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Chapter  5 


Abstract 


The  calculation  of  the  discrete  Fourier  transform  (DFT)  has  long  been  a  significant 

V'  * 

bottleneck  in  many  Digital  Signal  Processing  applications.  This  research  effort' con tri- 
butes  to  the  goal  of  implementing  a  very  large-scale  integrated  (VLSI)  circuit  which 
uses  the  Winograd  and  Good-Thomas  algorithms  for  computing  DFTs  with  composite 
blocklengths.  Winograd  processors  use  both  the  small  and  large  Winograd  algorithms 
to  compute  DFTs  with  blocklengths  of  15,  16,  and  17.  Longer  blocklength  DFTs  (210. 
255,  272,  and  4080  are  computed  using  a  pipeline  of  Winograd  processors,  dula-port 
memories,  and  a  controller.  The  pipeline  uses  the  Good-Thomas  Prime  Factor  Algo¬ 
rithm  (PFA).  Fault  tolerance  was  enhanced  by  the  use  of  watchdog  processors  to  pro- 

i 

vide  concurrent  fault  detection  and  error  control  coding(ECC)  to  provide  fault  mask¬ 
ing. 

A  pipeline  controller  was  designed  as  a  Reduced  Instruction  Set  Computer  (RISC) 
to  control  the  pipeline  and  regulated  the  flow  of  data.  Also,  a  dual  ported,  double 
buffered  RAM  memory  was  designed  to  provide  intra-pipeline  data  storage. 


DESIGN  OF  FAULT  TOLERANT 


PRIME  FACTOR  ALGORITHM 
ARRAY  ELEMENTS 


CHAPTER  1 

INTRODUCTION 


1.1.  Background 

Many  of  the  problems  of  digital  signal  processing  involve  computation  of  the 
Discrete  Fourier  Transform  (DFT).  These  problems  need  spectral  information,  given  by 
the  DFT,  on  a  large  number  of  sample  points  to  obtain  the  necessary  spectral  resolu¬ 
tion  for  precise  calculation  of  operations  such  as  correlation  and  convolution.  Current 
applications  for  rapid  spectrum  analysis  include  radar,  sonar,  pattern  recognition,  voice 
processing,  target  acquisition  systems,  and  artificial  intelligence. 

With  the  introduction  of  large  scale  integration  of  digital  systems,  it  has  become 
practical  to  implement  complex  signal  processing  functions  on  a  single  chip.  Future 
applications  include  not  only  enhancements  of  current  implementations  but  also  many 
potential  applications  not  feasible  due  to  speed  and  size  limitations  of  today's  technol¬ 
ogy.  Advances  in  hardware  technology  must  be  matched  with  clever  algorithmic 
designs  to  handle  the  computational  tasks.  Although  new  DFT  algorithms  and  systems 
architectures  are  constantly  being  developed,  the  most  advanced  systems  available 
today  are  still  not  able  to  satisfy  all  the  military  requirements  for  compact,  high  speed 
digital  signal  processing  systems  that  utilize  the  DFT 
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Therefore,  the  Air  Force  Office  of  Scientific  Research,  DARPA  -  Defense  Advanced 
Research  Projects  Agency,  and  Air  Force  Space  Division  are  sponsoring  research  for  the 
development  of  a  high  speed  DFT  processor.  To  achieve  a  more  advanced  DFT  pro¬ 
cessing  system  which  meets  the  Air  Force  requirements,  new  transform  algorithms  and 
hardware  architectures  must  be  combined  to  simultaneously  increase  data  processing 
throughput  and  decrease  size  and  weight.  One  method  of  increasing  the  throughput  is 
to  reduce  the  number  of  operations  required  to  compute  the  DFT.  Winograd  has 
shown  a  class  of  algorithms  (Winograd  Fourier  Transform  Algorithms,  WFTA)  to  use 
the  fewest  number  of  multiplications  in  computing  the  DFT  [Wi78].  Reducing  the 
number  of  multiplications  gives  a  larger  increase  in  performance  than  reducing  the 
number  of  additions  because  the  multiplications  are  more  complicated,  requiring  several 
additions  to  yield  the  product. 

The  Winograd  DFT  algorithm  is  of  interest  in  VLSI  because  the  matrix  form  of 
the  algorithm  maps  efficiently,  in  terms  of  space  and  regularity  of  structure,  into  a  sig¬ 
nal  processing  architecture.  Also,  large  data  block  lengths  may  be  computed  by  com¬ 
bining  various  Winograd  modules  into  a  pipelined  architecture  in  the  manner  specified 
by  the  Good-Thomas  Prime  Factor  algorithm.  The  Good-Thomas  Prime  Factor  Algo¬ 
rithm  (PFA)  uses  relatively  prime  modules  to  compute  true  multi-dimensional  DFTs. 
The  4080-point  processor  was  chosen  as  a  representative  system.  The  408Opoint  PFA 
processor  uses  Winograd  modules  of  lengths  15,16,  and  17.  The  layout  of  the  4080 
point  PFA  processor  is  shown  in  Figure  1-1.  There  are  separate  chips  for  each  Wino¬ 
grad  processor,  for  each  dual-port  memory,  and  for  the  PFA  controller.  Systems  for 
computing  DFTs  with  block  lengths  of  240,  255,  and  272  points  will  use  one  less  Wino¬ 
grad  processor  and  dual-port  memory. 

In  the  figure  the  memories  are  divided  in  two  parts  with  hashing  to  illustrate  the 
symmetrical  operations  of  the  memory  accesses.  At  any  point  in  time  the  pipeline  is 
structured  so  that  all  reads  are  accomplished  on  memory  banks  with  one  hashing  and 
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Figure  1-1  4080-POINT  PFA  PROCESSOR 


all  writes  are  accomplished  on  the  memory  banks  with  the  other  type  of  hashing. 

1.2.  Problem 

To  implement  a  4080-point  PFA  pipeline  certain  aspects  must  be  considered 

1.  WFTA  processors 

2.  A  Reduced  Instruction  Set  Computer  (RISC)  for  PFA  pipeline  control 

3.  High  Speed  RAM  with  Error  Correction  Circuitry  for  the  PFA  pipeline 

4.  Clock  Generation 

•V  Hybrid  Circuit/Board  Design 

This  thesis  will  be  concerned  with  the  design  of  a  pipeline  controller  to  control  tin 
pipeline  and  regulate  the  flow  of  data.  Also,  an  intermediate  data  storage  medium 


Random  Access  Memories,  will  be  implemented  to  facilitate  the  control  of  the  data 
flow.  This  thesis  will  promote  fault  tolerance  in  the  structure  of  the  pipeline  and  its 
management.  Also,  fault  tolerance  will  be  enhanced  by  memory  error  detection  and 
correction  circuitry. 

1.3.  Scope 

1.3.1.  Past  Theses. 

Four  past  theses  addressed  other  aspects  of  the  PFA  pipeline  design.  Taylor 
presented  the  PFA  and  WFTA  theory,  overall  signal  processing  systems  architecture, 
and  numerical  precision  simulation  results  [Ta85j.  Coutee  presented  the  arithmetic  cir¬ 
cuitry  for  the  16  point  WFT  processor  (WFTA16)  [Co85j.  Rossbach  presented  the  con¬ 
trol  for  the  high  speed  Winograd  DFT  processor  [Ro85j.  Collins  presented  a  validation 
program  for  WFTA16  processor  operations,  and  described  the  WFTA16  modules  in 
VHSIC  (Very  High  Speed  Integrated  Circuit)  Hardware  Description  Language  (VHDL) 
(Co85 . |. 

1.3.2.  Parallel  Theses. 

Shephard  presents  a  design  for  testability  of  the  WTTA  16  processor,  integrates 
the  macro  cells  designed  by  Coutee  and  Rossbach  and  completes  the  design  of  the 
WFTA16  processer  [Sh86j.  Cooper  presents  a  description  of  WFTA  modules  in  VHDL 
Co86|. 

1.3.3.  Scope  of  This  Thesis.  The  scope  of  this  thesis  is  to  complete  the  development 
of  a  fault  tolerant  PFA  pipeline  processor  by  specifying  the  architecture  of  the  PFA  controller: 
designing  a  high  speed  RAM  memory  with  error  correction  circuitry  (ECC)  for  the  PFA  pipeline; 
and  prototyping  the  ECC  chip  in  a  three-micron  CMOS  technology. 


1.3.4.  Future  Theses. 


Clock  generation  and  Hybrid  Circuit/Board  Design  will  be  addressed  in  future 
theses. 


1.4.  Summary  of  Current  Knowledge 
1.4.1.  CMOS  Technology. 

CMOS  (Complementary  Metal  Oxide  Silicon)  technology  is  recognized  as  a  leading 
contender  for  existing  and  future  VLSI  systems.  CMOS  provides  an  inherently  low 
power  static  circuit  technology  that  has  the  capability  of  providing  a  lower  power-delay 
product  than  NMOS  or  PMOS  technologies  using  comparable  design-rules  [We85], 
The  PFA  pipeline  components  are  being  implemented  in  CMOS  technology.  When  dis¬ 
cussing  the  performance  characteristics  of  CMOS  circuits,  the  main  points  to  observe 
are  that  a  CMOS  gate  consumes  no  DC  power  when  the  output  is  a  "one"  or  "zero" 
level  and  the  output  logic  levels  of  CMOS  are  fully  restored. 

Two  basic  circuits  used  are  the  transmission  gate  and  the  inverter.  The  transmis¬ 
sion  gate  consists  of  a  PMOS  transistor  in  parallel  with  an  NMOS  transistor.  The 
NMOS  transistor  passes  a  weak  "one"  and  a  strong  "zero”,  while  the  PMOS  transistor 
passes  a  weak  "zero"  and  a  strong  "one".  Since  the  PMOS  transistor  is  in  parallel  with 
the  NMOS  transistor,  the  transmission  gate  passes  a  strong  "one”  and  a  strong  "zero". 
Therefore  the  transmission  gate  closely  approximates  an  ideal  switch.  The  inverter  is 
also  built  to  ensure  that  a  strong  signal  is  always  output.  The  inverter  outputs  Vdd  (5 
volts)  when  "zero"  is  input  and  outputs  Vss  (GND)  when  a  "one"  is  input.  Only  one 
MOS  transistor  is  turned  on  when  the  inverter  is  in  either  state.  Therefore,  there  is  no 
direct  current  flow  when  the  inverter  is  not  changing  state  and  no  static  power  dissipa¬ 
tion  except  for  a  small  amount  due  to  junction  leakage  During  switching  both  transis¬ 
tors  are  on  and  current  is  drawn  only  while  the  input  voltage  passes  from  the  threshold 
voltage  for  the  NMOS  transistor  to  Vdd  minus  the  threshold  voltage  for  the  PMOS 
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transistor. 


CMOS  circuits  are  susceptible  to  latch-up  because  of  the  presence  of  a  p-n-p-n 
structure.  The  circuit  designs  and  I/O  pad  designs  of  the  WFTA  chips  are  designed  to 
protect  against  latch- up  [Li86j. 

1.4.2.  PFA  and  WFTA. 

1.4.2. 1.  Small/Large  WFTA’s  PFA. 

Constructing  a  DFT  using  Winograd’s  small  DFT  algorithm  requires  knowledge  of 
Rader’s  prime  algorithm  [Ra68]  and  Winograd’s  short  convolution  algorithms  [Wi78], 
Algorithms  are  presented  in  great  detail  by  Blahut  [B185]  and  McClellan  and  Rader 
[Mc79],  Winograd’s  Large  DFT  algorithm  produces  a  longer  block  length  by  combining 
relatively  prime  small  Winograd  modules  to  yield  a  block  length  equal  to  the  product  of 
the  block  lengths  of  the  small  Winograd  modules.  The  Winograd  modules  provide  an 
efficient  means  of  computing  short  and  medium  length  DFTs.  One  of  the  drawbacks  of 
using  the  large  Winograd  modules  is  the  matrices  become  large  and  unwieldy,  imposing 
serious  memory  constraints  for  DFT  computation  (eg.,  a  255-point  transform  com¬ 
puted  using  the  large  Winograd  algorithm  requires  a  multiplication  matrix  of  size  1280). 
The  Good-Thomas  prime  factor  algorithm  requires  more  overall  multiplications  than 
the  large  Winograd  algorithm.  But,  since  the  DFT  is  broken  into  several  component 
parts,  the  total  number  of  multiplications  is  resolved  into  several  subproblems  (e  g.,  a 
255-point  DFT  computed  using  the  Good-Thomas  PFA  requires  multiplication  matrices 
of  size  18  and  36  for  the  15-point  module  and  the  17-point  module,  respectively;  for  a 
total  number  of  1900  multiplications  [Ta85l. 

1.4. 2. 2.  Fault  Tolerance. 

There  are  two  fundamentally  different  approaches  that  can  be  taken  to  increase 
the  reliability  of  computing  systems.  The  first  approach  is  called  fault  prevention  and 
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the  second  approach  is  called  fault  tolerance.  In  the  traditional  fault  prevention 
approach,  the  objective  is  to  increase  the  reliability  by  a  prior  elimination  of  all  faults. 
Although  semiconductor  manufacturers  try  to  ensure  that  their  products  are  reliable,  it 
is  almost  impossible  to  eliminate  any  chance  faults  somewhere  in  a  system.  The  reliabil¬ 
ity  of  a  system  can  be  increased  by  employing  the  method  of  worst  case  design,  using 
high  quality  components  and  imposing  strict  quality  control  procedures  during  the 
assembly  phase.  However,  such  measures  can  increase  the  cost  of  a  system  significantly. 
An  alternative  approach  to  reliable  system  design  is  to  incorporate  redundancy  into  a 
system  with  the  aim  of  masking  the  effects  of  faults.  This  approach  does  not  necessi¬ 
tate  the  use  of  high-quality  components,  instead  standard  components  can  be  used  in  a 
redundant  and  reconfigurable  architecture.  In  view  of  the  decreasing  cost  of  hardware 
components  it  is  certainly  less  expensive  to  use  the  second  approach  to  design  reliable 
systems.  This  second  approach  is  called  fault  tolerance. 

Fault  tolerance  is  the  ability  of  a  system  to  operate  in  the  presence  of  errors. 
Fault  tolerance  includes  fault  avoidance,  fault  detection,  and  system  recovery  Fault 
avoidance  is  an  attempt  to  reduce  the  number  of  faults  by  careful  component  design 
[Si84]  , 

Digital  systems,  even  when  designed  with  highly  reliable  components,  do  not 
operate  forever  without  developing  some  faults.  When  a  system  ultimately  does 
develop  a  fault,  the  fault  has  to  be  detected  and  located  so  that  the  effect  of  the  fault 
can  be  removed.  Fault  detection  means  the  discovery  of  something  wrong  in  a  digital 
systems  or  circuit.  A  fault  may  or  may  not  cause  a  failure.  A  failure  is  said  to  have 
occurred  if  it  deviates  from  it’s  specified  behavior  A  fault  may  be  permanent  or  tem¬ 
porary.  A  major  portion  of  digital  system  malfunctions  are  caused  by  temporary  faults 
Transient  faults  are  non-recurring  temporary  faults.  They  are  usually  caused  by  alpha 
particle  radiation  or  power  supply  fluctuation,  and  they  are  not  repairable  because 
there  is  no  physical  damage  to  the  hardware.  They  are  the  major  source  of  problems  in 
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semiconductor  memory  chips. 


Intermittent  faults  are  recurring  faults  that  reappear  on  an  irregular  basis.  Such 
faults  can  occur  due  to  loose  connections,  partially  defective  components,  poor  designs, 
or  environmental  conditions  such  as  temperature,  humidity,  vibration,  etc.  The  likeli¬ 
hood  of  intermittents  due  to  environmental  conditions  depends  on  how  well  the  system 
is  protected  from  its  physical  environment  through  cooling,  shielding,  filtering,  etc. 
Since  intermittent  faults  are  random,  they  can  be  modeled  only  by  using  probabilistic 
methods  [Br73] . 

Two  types  of  techniques  are  used  in  order  to  prevent  temporary  faults  from  caus¬ 
ing  system  failures:  fault  masking  and  concurrent  fault  detection. 

Fault  masking,  also  known  as  static  redundancy,  uses  extra  components  so  that 
the  effect  of  a  faulty  component  is  masked  instantaneously.  Two  major  techniques 
employed  to  obtain  fault  masking  are  triple  modular  redundancy  (TMR)  and  the  use  of 
error  control  coding  (ECC).  TMR  was  originally  suggested  by  von  Neumann  [Ne56l. 

Essentially,  a  voting  element  accepts  the  outputs  of  three  sources  and  delivers  the 
majority  vote  as  its  output.  It  has  been  suggested  that  the  TMR  scheme  could  also  be  a 
practical  technique  for  designing  fault  tolerant  VLSI  chips  'Se78i.  The  use  of  error 
control  coding  will  be  discussed  in  section  1  4.1.4. 

Concurrent  fault  detection,  also  known  as  standby  redundancy,  uses  extra  com¬ 
ponents  to  generate  error  signals.  The  presence  of  a  fault  is  detected,  and  a  subsequent 
recovery  action  either  eliminates  the  fault  or  corrects  the  error.  The  principle  tech¬ 
niques  are:  replacement  of  the  faulty  element  or  system  by  a  standby  spare;  reorgani¬ 
zation  of  the  system  into  a  different  configuration  These  two  methods  presuppose  the 
existence  of  a  diagnostic  procedure  which  will  recognize  the  symptoms  Mo56],  and  of 
a  switch  which  implements  the  replacement  or  reconfiguration  F168 j  In  general, 
dynamic  redundant  systems  can  be  divided  into  two  categories  cold-standbv  system 
and  hot-standby  system.  In  a  cold-standby  system  one  module  is  powered  up  and 


operational,  the  rest  are  not  powered,  i.e. ,  they  are  cold  spares.  Replacement  of  a 
faulty  module  by  a  spare  is  effected  by  turning  off  its  power  and  powering  a  spare.  In  a 
hot-standby  system  all  the  modules  are  the  same,  the  output  of  any  arbitrarily  selected 
module  can  be  taken  as  the  system  output.  When  a  fault  is  detected  in  a  module  the 
system  is  reconfigured  so  that  the  system  output  comes  from  one  of  the  remaining 
modules  [  La85l. 

1.4. 2. 3.  Memory. 

Random  access  memory  (RAM)  has  an  access  time  independent  of  the  physical 
location  of  the  data.  There  are  various  bipolar  and  MOS  technologies  used  to  imple¬ 
ment  random  access  memories.  The  bipolar  technologies,  which  include  ECL,  TTL.  and 
PL,  are  employed  to  construct  RAM  circuits  with  resistors,  diodes,  and  bipolar  transis¬ 
tors.  The  primary  characteristics  of  bipolar  RAMs  are  high  speed,  high  power  con¬ 
sumption,  and  low  density.  MOS  RAM  ICs  are  constructed  with  N-channel  and  P- 
channel  transistors.  The  two  technologies  widely  used  are  NMOS  and  CMOS.  These 
technologies  yield  RAMs  with  improved  density  and  power  characteristics  compared  to 
bipolar  technologies.  CMOS  RAMs  offer  another  advantage  over  bipolar  RAMs,  low 
power  consumption. 

The  first  MOS  technology  used  to  manufacture  memory  devices  was  PMOS  tech¬ 
nology.  P-channel  MOSFETs  are  used  as  the  memory  cell.  The  result  was  high  den¬ 
sity,  low-cost  memories,  but  with  slow  operation  and  high  power  consumption. 

The  next  technology,  NMOS,  results  in  faster  operating  speeds  because  the  mobil¬ 
ity  of  electrons  is  higher  for  NMOS  than  PMOS  N-channel  MOSFETs  are  used  as  the 
memory  cell.  Not  only  are  high  operating  speeds  achieved  with  NMOS.  but  NMOS  also 
yields  higher-density  circuitry  than  does  PMOS  These  devices  are  still  much  slower 
than  bipolar  memories  and  do  not  overcome  the  high  power  limitation  (at  least  for 


The  problem  of  high  power  consumption  is  overcome  with  CMOS  technology. 
Both  P-channel  and  N-channel  MOSFETs  are  used  together  to  make  a  low-power 
storage  cell,  one  MOSFET  acts  as  a  switch  and  the  other  as  a  load  resistance.  Since 
both  P-channel  and  N-channel  devices  are  used,  the  cells  are  less  dense  than  NMOS 
RAM  cells.  The  resulting  memory  cell  has  very  good  power  characteristics.  Also, 
memories  made  with  CMOS  technology  enjoy  higher  noise  immunity  than  memories 
made  with  any  other  technology  [Tr82j. 

The  memory  cells  used  in  RAMs  can  be  divided  into  two  basic  categories:  static 
RAM  (SRAM)  and  dynamic  RAM  (DRAM).  SRAMs  use  some  form  of  latched  storage, 
while  DRAMs  use  dynamic  storage  by  charging  a  capacitor. 

Increasing  demands  for  lower  cost,  higher  bit  density,  and  better  performance  have 
caused  a  proliferation  of  new  semiconductor  memories.  Currently,  off-the-shelf  256k-bit 
memories  have  access  times  of  approximately  40  nanoseconds  [Ya84] 

14.2.4.  Error  Detection  and  Correction  Code.  In  1948,  Shannon 
demonstrated  that,  by  proper  encoding  of  the  information,  errors  induced  by  a  noisy  channel  or 
storage  medium  can  be  reduced  to  any  desired  level  without  sacrificing  the  rate  of  information 
transmission  or  storage.  A  large  amount  of  effort  has  been  expended  on  the  problem  of  devising 
efficient  encoding  and  decoding  methods  for  error  control  in  a  noisy  environment.  The  use  of 
coding  for  error  control  has  become  an  integral  part  in  the  design  of  modern  communication  sys¬ 
tems  and  digital  computers.  The  transmission  and  storage  of  digital  information  have  many 
similarities.  Both  transfer  data  from  an  information  source  to  a  destination  A  typical  storage 
(transmission)  system  may  be  represented  by  Figure  1-2. 

The  encoder  transforms  the  information  sequence  l  into  a  discrete  encoded 
sequence  V  called  a  code  word  This  waveform  enters  the  storage  medium  and  is  cor¬ 
rupted  by  noise.  The  decoder  processes  the  received  sequence  R  The  decoder 
transforms  the  received  sequence  R  into  a  binary  sequence  1  called  the  estimated 
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Figure  1-2  TYPICAL  STORAGE  SYSTEM 


sequence.  The  decoding  strategy  is  based  on  the  rules  of  channel  encoding  and  the 
noise  characteristics  of  the  storage  medium.  Ideally,  U’  will  be  a  replica  of  the  informa¬ 
tion  sequence  U,  although  the  noise  may  cause  some  decoding  errors. 

There  are  two  different  types  of  codes  in  common  use  today,  block  codes  and  con¬ 
volution  codes.  The  encoder  for  a  convolution  code  accepts  k-bit  blocks  of  the  informa¬ 
tion  sequence  U  and  produces  an  encoded  sequence  V  of  n-symbol  blocks.  In  convolu¬ 
tional  coding,  the  symbol  U  and  V  are  used  to  denote  sequences  of  blocks  rather  than  a 
single  block.  However,  each  encoded  block  depends  not  only  on  the  corresponding  k-bit 
message  block  at  the  same  time  unit,  but  also  on  m  previous  message  blocks.  Therefore, 
the  encoder  has  a  memory  order  of  m. 

The  encoder  for  a  block  code  divides  the  information  sequence  into  message  blocks 


of  k  information  bits  each  In  block  coding,  the  symbol  l  is  used  to  denote  a  k-bit 


message  rather  than  the  entire  information  sequence.  Since  the  n-symbol  output  code 
word  depends  only  on  the  corresponding  k-bit  input  message,  the  encoder  is  memorv- 
less. 

1.5.  Approach 

The  responsibilities  for  further  refinement  and  development  of  a  fault  tolerant 
PFA  pipeline  were  distributed  between  the  three  thesis  efforts  presented  previously. 
The  macro  cells  of  the  WFTA16  were  to  be  integrated  with  testability  incorporated 
into  the  design.  The  development  of  a  fault  tolerant  PFA  pipeline  processor  was  to  be 
completed  by:  specifying  the  architecture  of  the  PFA  controller;  designing  a  pipeline 
memory  bank  with  ECC;  and  prototyping  the  ECC  memory  chip  in  three  micrometer 
CMOS.  The  WFT  modules  were  to  be  described  in  VHDL. 

1.6.  Sequence  of  the  Presentation 

Chapter  2  examines  the  requirements  for  control  of  the  PFA  pipeline,  memory 
capacity  and  control,  and  suggests  alternative  approaches  to  satisfy  those  requirements. 

Chapter  3  presents  an  architecture  and  algorithm  to  best  satisfy  the  requirements 
presented  in  chapter  2. 

Chapter  -4  presents  aspects  of  the  VLSI  implementation  of  the  circuits  described  in 
chapter  3,  block  diagrams  and  some  unique  cells. 

Chapter  5  offers  recommendations  and  conclusions 


CHAPTER  2 


DETAILED  ANALYSIS  OF  THE  PROBLEM 


2.1.  Introduction 

As  stated  previously,  this  thesis  effort  encompasses  the  design  of  a  controller  for 
the  Prime  Factor  Algorithm  (PFA)  pipeline  and  the  design  of  dual  ported,  double 
buffered  high  speed  RAM  memory  with  error  detection  and  correction  circuitry  for  the 
pipeline. 

2.2.  RISC  PFA  Controller 

The  control  tasks  for  the  PFA  pipeline  can  be  divided  into  two  major  sections. 
One  section  is  concerned  with  the  sequencing  of  the  elements  of  the  pipeline,  this  sec¬ 
tion  is  called  data  flow  control.  The  other  section  is  concerned  with  the  fault  tolerance 
operations,  this  section  is  called  configuration  control. 

2.2.1.  Requirements  for  Data  Flow  Control. 

The  4080  point  PFA  system  uses  independent  Winograd  processors  of  length  16, 
15,  and  17.  DFTs  with  block  lengths  of  240,  255,  and  272  points  can  be  implemented 
with  the  Winograd  modules  of  length  15  and  16,  15  and  17.  and  16  and  17.  respec¬ 
tively.  Since  the  Winograd  processors  operate  autonomously  they  can  be  pipelined. 
The  layout  of  the  4080-point  PFA  pipeline  was  presented  in  Figure  1-1  There  are 
eight  separate  chips:  three  Winograd  processors,  four  dual-port  memories,  and  the  PFA 
controller  The  dual-port  memories  must  allow  one  processor  (or  host)  to  write  to  half 
of  the  memory  while  another  processor  reads  from  the  other  half  of  the  memory  The 
PFA  controller  provides  overall  control  of  the  PFA  pipeline  Asynchronous  control  sig¬ 
nals  to  the  memories,  to  and  from  the  Winograd  processors,  and  to  and  from  the  host 


must  flow  through  the  PFA  controller.  The  only  exception  to  this  rule  is  the  write 
(read)  strobe  generated  by  the  Winograd  processors  or  by  the  host  which  is  sent 
directly  to  the  memories. 

The  PFA  controller  operates  autonomously,  communicating  with  the  host  only  at  the 
beginning  and  end  of  each  DFT.  The  interface  between  the  PFA  controller  and  each  of 
the  Winograd  processors  is  similar.  The  Winograd  processors  also  operate  auto¬ 
nomously  and  communicate  with  the  PFA  controller  only  at  the  beginning  and  end  of 
each  16,  15,  17-point  DFT.  The  only  control  the  PFA  controller  exerts  on  the 
memories  is  to  switch  the  status  of  the  left  and  right  halves  of  the  memories. 

The  operation  of  the  PFA  pipeline  can  be  explained  by  stating  that  even  though 
data  flows  serially  through  the  pipeline  there  is  a  small  sequence  of  operations  that  are 
performed  in  parallel  for  similar  modules. 

There  are  four  PFAC-memory  interfaces.  Also,  there  are  five  PFA C- processor 
interfaces.  Three  of  the  P FAC- processor  interfaces  are  the  interfaces  to  the  Winograd 
processors.  The  other  two  PFAC-processor  interfaces  are  the  input  host  to  the  PFAC 
and  the  output  host  to  the  PFAC.  The  input  host  and  the  output  host  can  be  the 
same  entity  but  this  is  not  required.  As  a  matter  of  fact,  the  two  functions  probably 
won't  be  accomplished  by  the  same  entity.  The  input  data  will  probably  be  input  by  a 
host  with  limited  analytical  capabilities.  Whereas,  the  output  data  will  be  accepted  by 
a  host  with  the  ability  to  analyze  the  data  and  then  make  decisions  concerning  the  vali¬ 
dity  of  the  data  and  the  health  of  the  PFA  pipeline 

The  PFAC-memory  interface  consists  of  one  signal.  RlCMT  I.KFT  (R  1.)  When 
R  L  is  high,  data  is  read  from  the  right  side  of  memory  and  written  i<>  the  left  -ude  .  >f 
memory 

All  of  the  PFAC- processor  interfaces  have  OPKRATK  (OPR)  and  DONK  -icnals 
The  PF AC- processor  interfaces  for  the  Winograd  processors  have  additional  signals  for 
processing  data.  These  signals  are  used  to  pass  the  block  floating  point  exponent  to  the 


Winograd  processors  (SFO,  SFl,  SF2)  and  to  receive  the  block  floating  point  exponent 
from  the  Winograd  processors  (SOO,  SOI,  S02).  The  PFAC'  must  accumulate  the  total 
block  floating  point  exponent  for  each  transform  computed. 

All  four  of  the  PFAC-memory  interfaces  will  operate  in  a  parallel  lock-  step 
fashion  as  will  all  five  of  the  PFAC- processor  interfaces. 

Initially,  the  R/L  signal  is  high.  All  memories  will  receive  the  same  R/L  signal 
simultaneously.  Next,  the  PFAC  must  pass  the  magnitude  of  the  input  transform  data 
(SFO,  SFl,  SF2).  Then,  a  master  OPERATE  (OPR)  will  be  transmitted  to  all  PFAC- 
processor  interfaces  simultaneously.  The  PFAC  will  monitor  the  DONE  lines  for  all 
five  PFAC- processor  interfaces.  The  PFAC  waits  until  all  five  DONE  signals  are  high, 
then  the  PFAC  latches  the  magnitude  of  the  scaling  exponent  output  by  each  Wino¬ 
grad  processors.  The  PFAC  then  lowers  OPERATE.  While  OPERATE  is  low,  the 
PFAC  accumulates  the  magnitude  of  the  output  transform  data  for  each  problem  in 
the  pipeline  and  switches  the  R/L  signal. 

2.2.2.  Requirements  for  Configuration  Control. 

The  configuration  control  section  of  the  PFAC  is  concerned  with  fault  detection, 
fault  correction,  and  system  recovery  The  PFAC  must  keep  track  of  the  number  of 
errors  reported  by  each  element  of  the  PFA  pipeline  and  report  these  errors  to  the  host 
The  memories  have  on-chip  error-correcting  codes  to  provide  fault  detection  capability. 
Errors  that  are  detected  and  corrected  are  reported  to  the  PFAC  using  the  ERROR 
CONTROL  CODE  CORRECTED  (ECCC)  signal.  Errors  that  are  detected  and 
uncorrected  are  reported  to  the  PfAC  using  the  ERROR  CONTROL  CODE 
UNCORRECTED  (ECCU)  signal. 

If  a  memory  chip  should  suffer  catastrophic  failure,  the  on-chip  ECO  would  be 
useless.  However,  parity  checking  on  the  Winograd  chips  will  detect  this  condition  and 
report  to  the  PFAC  using  the  signal  PARITY  ERROR  (PE).  Odd  parity  is  used  so 


the  "stuck  at  zero"  or  "stuck  at  one"  states  (caused  by  memory  power  failure)  may  be 
detected. 

The  pipeline  can  be  configured  as  shown  in  Figure  2-1.  In  the  absence  of  faults 
the  PFA  pipeline  can  operate  correctly  as  shown  in  Figure  1-1  However,  digital  sys¬ 
tems  do  not  operate  forever  without  developing  some  faults.  The  configuration  shown 
in  Figure  2-1  employs  concurrent  fault  detection.  The  extra  components  generate  error 
signals.  The  Winograd  processors  can  operate  normally  (active  mode)  or  in  the  watch¬ 
dog  mode.  The  PFAC  programs  each  active  and  watchdog  by  setting  or  resetting  a 
watchdog  mode  flip-flop  on  each  Winograd  chip. 
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The  particular  configuration  shown  is  three  active  processors  in  series  to  form  the 
basic  PFA  pipeline,  with  two  watchdog  processors  in  parallel  with  each  active  proces¬ 
sor.  In  theory  as  many  watchdogs  as  desired  can  be  placed  in  parallel.  The  present 
PFA  controller  only  allows  two  Watchdogs  due  to  pin  limitations.  Each  watchdog  pro¬ 
cessor  computes  the  same  DFT  as  its  corresponding  active  processor.  If  the  DFT 
results  of  the  watchdog  processor  do  not  agree  with  those  computed  by  the  active  pro¬ 
cessor,  the  watchdog  processor  reports  a  WATCHDOG  ERROR  (WE)  to  the  PFAC. 

The  PFAC  must  decide  if  the  active  processor  is  in  error  or  if  a  watchdog  proces¬ 
sor  is  in  error.  Since  there  are  two  watchdog  processors,  in  this  case,  the  PFAC  can  use 
voting  to  assign  errors.  If  the  error  is  assigned  to  the  watchdog  processor  then  the 
PFAC  simply  keeps  track  of  the  error  and  reports  the  error  to  the  host.  If  the  error  is 
assigned  to  an  active  processor  then  the  PFAC  must  still  keep  track  of  the  error  and 
report  to  the  host  but  must  also  report  to  the  host  that  the  data  may  be  erroneous,  bv 
turning  off  the  DATA  VALID  bit  in  the  problem  status  word. 

Whenever  an  active  processor  commits  an  error  the  PFAC  must  reconfigure  that 
element  of  the  pipeline  ( i.e. ,  decide  which  of  the  watchdog  processors  to  make  the  active 
processor).  If  the  number  of  errors  assigned  to  the  different  processors  are  not  equal 
then  the  current  watchdog  processor  with  the  smallest  number  of  errors  is  programmed 
as  the  new  active  processor.  After  an  error  the  active  must  become  a  watchdog  unless 
the  other  two  are  marked  as  "dea.i”  in  the  health  register  The  status  of  the  PFA  ele¬ 
ments  must  be  placed  on  the  data  lines  of  the  output  by  the  PFAC  whenever  the  host 
inquires  about  status.  The  host  can  control  the  configuration  of  the  pipeline  by  sending 
in  a  configuration  word  The  host  can  interrupt  operations  to  check  status  of  the  pipe¬ 
line  and  to  input  limited  instructions  regarding  the  configuration  of  the  pipeline  and 
error  reporting. 

The  sequence  of  operations  necessary  to  control  the  flow  of  data 
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2.2.3.  Approaches  and  Trade-Offs. 

The  PFAC  is  a  sequential  state  machine.  A  method  for  implementing  control  cir¬ 
cuitry  for  a  sequential  state  machine  is  needed.  There  are  a  number  of  design  methods 
available  to  implement  a  sequential  control  circuit.  The  most  popular  of  these  control 
methods  include  custom  logic  implementations,  gate  array  implementations,  programm¬ 
able  logic  arrays  (PLA)  or  ROMS. 

Custom  logic  implementations  are  an  interconnection  of  flip-flops  and  gates.  The 
gates  by  themselves  constitute  a  combinational  circuit  but  when  included  with  the  flip- 
flops,  the  overall  circuit  is  classified  as  a  sequential  circuit.  The  design  procedure  con¬ 
sists  of  translating  the  circuit  specification  into  state  diagrams,  state  tables,  and  excita¬ 
tion  tables  [Ma79j. 

Gates  arrays  are  fixed  arrays  of  identical  logic  circuits.  Designing  with  gate  arrays 
consists  of  specifying  the  wiring  connections  between  the  given  logic  elements  in  the 
array  jOn84], 

ROMs  can  be  used  in  the  design  of  sequential  circuits.  A  ROM  generates  an 
input-output  relation  specified  by  a  truth  table.  As  such,  a  ROM  can  implement  any 
combinational  logic  circuit  with  k  inputs  and  n  outputs. 

A  programmable  logic  array  (PLA)  is  similar  in  concept  to  a  ROM  except  that  a 
PLA  does  not  provide  full  decoding  of  the  input  lines. 

ROMs  can  also  be  used  to  store  coded  information  that  represents  the  sequence  of 
internal  control  variables  needed  for  the  operation  of  a  sequential  state  machine  When 
a  ROM  is  used  to  store  binary  control  information,  it  is  called  microprogram  rouble  con¬ 
trol  ;Ma82).  The  control  unit  initiates  a  series  of  sequential  steps  ol  micro  operations 
Changes  can  be  made  to  a  microprogrammed  control  without  changing  the  hardware 
by  just  modifying  the  sequence  of  micro-operations 


The  PFAC  is  divided  in  function  between  data  flow  control  and  configuration  con¬ 
trol.  The  PFAC  data  flow  control  section  requires  limited  programmability.  Whereas, 
the  configuration  control  section  needs  to  have  a  greater  amount  of  programmability 
to  store  the  detection/reconfiguration  algorithms. 

2.3.  Memory  Capacity  and  Control 

2.3.1.  Requirements. 

With  today’s  1.2  micron  CMOS  technology,  all  the  memory  control  and  arithmetic 
circuitry  for  the  WFT  processor  can  be  placed  on  the  chip  [Sh86j.  However,  the  data 
generated  by  each  WFT  processor  cannot  be  stored  on  the  same  chip  Coutee  1985 
Therefore,  separate  chips  are  required  for  data  storage.  For  a  4080-point  transform  the 
memory  must  be  organized  into  two  sides  of  4080  48-bit  words  (24  bits  for  real  and  24 
bits  for  imaginary  data,  total  number  of  bits  =  391,680).  The  memory  must  be  dual- 
ported  and  double  buffered.  A  double  buffered  memory  enables  100%  utilization  of  the 
memory.  One  port  will  always  read  and  the  other  port  will  always  write.  Each  port 
will  be  connected  to  only  one  side  of  the  memory  at  a  time.  The  memory  must  accept 
a  signal  RIGHT/LEFT  (R/L)  to  switch  the  connection  of  the  ports  to  one  side  of 
memory  or  the  other  side  of  memory.  Each  port  has  a  WRITE  (READ)-STROBE  (WS) 
associated  with  its  operation.  The  memory  must  perform  a  write  (read)  operation  in 
response  to  this  signal.  The  access  time  for  the  data  memories  must  be  less  than  28 
nanoseconds. 

The  parity  checking  of  the  Winograd  processors  can  detect  catastrophic  failure  of 
a  memory  chip.  Error  control  code  (ECC)  circuitry  must  be  placed  on  the  memory  chip 
to  minimize  the  effect  of  transient  faults  on  the  data  The  ECC  should  correct  all  sin¬ 
gle  bit  error  patterns  and  detect  all  two  bit  error  patterns  In  addition,  the  code  should 
correct  as  many  double  bit  errors  as  possible. 


2.3.2.  Approaches  and  Trade-Offs. 


2.3.2. 1.  Static  RAM  /  Dynamic  RAM  . 

As  mentioned  in  chapter  1,  the  two  categories  of  RAM  memory  cells  are  static  and 
dynamic.  A  six  transistor  SRAM  cell  is  shown  in  Figure  2-3. 

A  four  transistor  SRAM  cell  can  be  obtained  by  replacing  the  p  transistors  of  the 
static  cell  with  resistors,  as  shown  in  Figure  2-4.  A  DRAM  can  ge  obtained  by  eliminat¬ 
ing  the  pullups  altogether.  Although  the  basic  DRAM  cell  is  smaller  than  the  SRAM 
cell,  the  DRAM  cell  has  to  be  refreshed  to  retain  the  contents  of  the  cell.  However, 
when  the  pipeline  is  operating  at  normal  operating  speed  the  amount  of  time  between 
the  write  and  read  operation  for  a  memory  location  is  less  than  the  amount  of  time  for 
a  refresh  cycle.  The  normal  operation  of  the  memory  would  refresh  the  memory  cells. 


so  that  separate  refresh  circuitry  would  be  unnecessary 

SRAMs  and  DRAMs  can  be  designed  with  less  transistors  per  memory  cell  How¬ 
ever,  these  denser  designs  require  control  or  at  least  knowledge  of  the  processes  that  will 
be  used  to  fabricate  the  memories.  SRAMs  are  easier  to  design  and  are  potentially  less 
troublesome  than  DRAMs.  SRAMs  tend  to  be  faster  (but  much  larger)  than  DRAMs 
(Weste  and  Eshragian,  1985). 

2  3.2.2.  Memory  Partioning  Options. 

Each  Winograd  processor  outputs  two  2-4-hit  data  words  in  parallel  One  21- bit 
data  word  is  for  the  real  data  and  the  other  is  for  the  imaginary  data.  The  data  is  sent 
to  the  memory  on  two  parallel  24- bit  busses.  Although  identical  parallel  operations  are 
performed  on  the  real  data  and  the  imaginary  data,  the  real  data  and  the  imaginary 
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data  can  be  treated  separately.  Therefore,  the  memory  can  be  partioned  between  real 
data  and  imaginary  data  as  shown  in  Figure  2-5  Also,  the  memory  read  and  write 
operations  are  mutually  exclusive.  Therefore,  the  memory  can  be  partioned  between 
read  and  write  operations  as  shown  in  Figure  2-6 


When  memory  is  partitioned  between  read  and  write  operations,  there  are  only 
two  address  busses,  the  address  busses  are  shorter  and  the  data  busses  are  longer. 


** 


With  this  approach,  the  address  can  be  selected  before  the  data  arrives. 

2. 3. 2. 3.  Error  Detection  and  Correction. 

As  stated  in  chapter  1,  by  proper  encoding  of  information,  errors  induced  by  a 
noisy  storage  medium  can  be  reduced  to  any  desired  level  without  sacrificing  the  rate  of 
information  storage.  There  are  two  different  types  of  codes  in  common  use  today,  block 
codes  and  convolution  codes.  Since  the  encoder  of  a  convolution  code  has  a  memory, 
sequential  logic  circuits  are  used  to  implement  the  convolution  code  encoder  However, 
the  encoder  for  a  block  code  does  not  have  a  memory.  Combinational  logic  circuits  are 
used  to  implement  a  block  code  encoder.  Since,  combinational  logic  circuits  are,  gen¬ 
erally,  easier  to  implement,  convolution  codes  will  not  be  discussed  further. 

The  encoder  for  a  block  code  divides  the  information  sequence  into  message  blocks 
of  "k"  information  bits  each.  A  message  block  is  represented  by  the  binary  k-tuple  U 
—  (U 1 ,  U2,  ....  Uk)  called  a  message.  In  block  coding,  the  symbol  "l"'  is  used  to  denote 
a  k-bit  message  rather  than  the  entire  information  sequence  There  are  a  total  of  2* 
different  possible  messages.  The  encoder  transforms  each  message  "u"  independently 

into  an  n-tuple  "V”  -  ( v  1 ,  v2 . vn)  of  discrete  symbols  called  a  cole  word  ’Y" 

denotes  an  n-symbol  block  rather  than  the  entire  encode  sequence  There  are  2* 
different  possible  code  words  at  the  encoder  output  corresponding  to  the  2*  different 
possible  messages. 

In  a  binary  code,  each  code  word  "V"  is  also  binary.  Since  for  a  binary  code  to  be 
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code  rate  R,  more  redundant  bits  can  be  added  by  increasing  the  block  length  n  of  the 
code  while  holding  the  ratio  k  n  constant. 

In  a  block-coded  system  the  decoder  output  "U”'  represents  the  k-bit  estimate  of 
the  encoded  message  The  decoder  must  produce  an  estimate  "l”“  of  the  information 
sequence  ”1  "  based  on  the  received  sequence  "R"  For  ease  of  rode  synthesis  and  imple¬ 
mentation,  attention  should  be  restricted  to  a  subclass  of  all  the  block  codes,  the  linear 
block  codes  Linear  block  codes  are  defined  and  described  in  terms  of  generator  and 
parity-check  matrices  By  definition,  a  block  code  of  length  n  and  2*  code  words  is 
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called  a  linear  (n,k)  code  if  and  only  if  its  2*  code  words  form  a  k-dimensional  subspace 
of  the  vector  space  of  all  the  n-tuples  over  the  Galois  field  GF(2). 

Any  k  linearly  independent  code  words  of  an  (n,k)  linear  code  can  be  used  to  form 
a  generator  matrix  "G"  for  the  code.  Since  an  (n,k)  linear  code  is  completely  specified 
by  the  k  rows  of  a  generator  matrix  "G",  the  encoder  has  only  to  store  the  k  rows  of 
"G"  and  to  form  a  linear  combination  of  these  k  rows  based  on  the  input  message  ”U”. 

A  desirable  property  for  a  linear  block  code  to  possess  is  a  systematic  structure, 
where  a  code  word  is  divided  into  two  parts,  the  message  part  and  the  redundant 
checking  part.  The  message  part  consists  of  k  unaltered  information  digits  and  the 
redundant  checking  part  consists  of  n-k  parity-  check  digits,  which  are  linear  sums  of 
the  information  bits.  A  linear  block  code  with  this  structure  is  referred  to  as  a  linear 
systematic  code.  A  linear  systematic  (n,k)  code  is  completely  specified  by  a  k  x  n 
matrix  "G"  shown  in  Figure  2-7,  where  Pij=0  or  1.  "Ik"  denotes  the  k  x  k  identity 
matrix.  Then  "G"  =  ("P"  *  ’Ik”).  Furthermore,  there  exists  an  (n-k)  x  n  matrix  H 
(parity-check  matrix  )  such  that  an  n-tuple  "V"  is  a  code  word  if  and  only  if  "V"  *  "Ht" 
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Then  "S"  =  0  if  and  only  if  "R"  is  a  code  word,  and  "S"  not  equal  0  if  and  only  if 
"R"  is  not  a  code  word.  It  is  possible  that  the  errors  in  certain  error  vectors  are  not 
detectable  (i.e.,  "R"  contains  errors  but  "S"  =  "R"  *  "Ht"  =  0).  This  happens  when  the 
error  pattern  "E"  is  identical  to  a  nonzero  code  word.  Erro.  patterns  of  this  kind  are 
called  undetectable  error  patterns.  When  an  undetectable  error  pattern  occurs,  the 
decoder  makes  a  decoding  error.  The  syndrome  "S"  computed  from  the  received  vector 
"R”  actually  depends  only  on  the  error  pattern  "E"  and  not  on  the  transmitted  code 
word  "V". 

S  =  R  *  Ht  =  (V  -  E)  Ht  =  V  *  Ht  -  E  *  Ht 
V  *  IU  =  0 

S  =  E  *  Ht 

Determining  the  true  error  vector  "E"  is  not  a  simple  matter.  There  are  2*  error 
patterns  that  result  in  the  same  syndrome,  and  the  true  error  pattern  "E"  is  just  one  of 
them.  To  minimize  the  probability  of  a  decoding  error,  the  most  probable  error  pattern 
is  the  one  that  has  the  smallest  number  of  nonzero  digits 

The  decoding  scheme  is  called  syndrome  decoding  or  table-lookup  functions 

EO  FOfSO.Sl . Sn-k-1), 

El  -=  Fl(S0,Sl . Sn-k-1). 

* 

* 

En-1  -  Fn-l(S0,Sl.  , Sn-k-1), 

where  SO,  Si . Sn-k-1  are  the  syndrome  bits,  which  are  regarded  as  switching  vari¬ 
ables,  and  EO.  El . En-I  are  the  estimated  error  digits  When  these  n  switching 

functions  are  derived  and  simplified,  a  combinational  logic  circuit  with  the  n-k  syn- 
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drome  digits  as  inputs  and  the  estimated  error  digits  as  outputs  can  lie  obtained  The 
switching  expression  for  the  error  digits  is  the  ANDing  of  the  syndrome  digits  (or  the 
complement  of  the  syndrome  digits).  A  conceptual  representation  of  the  decoder  is 
shown  in  Figure  2-H  Krmr  control  codes  are  discussed  m  detail  by  I. in  and  Costello 
LiSd 
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CHAPTER  3 


ARCHITECTURE  AND  ALGORITHMS 

3.1.  PFA  Controller 

In  chapter  2,  several  design  methods  were  presented  for  implementing  control  cir¬ 
cuitry  for  the  PFA  controller  (PFAC).  These  design  methods  included  custom  logic 
implementations,  gate  array  implementations,  and  programmable  logic  arrays  (PLA)  or 
ROMs.  Each  method  has  its  advantages  and  disadvantages.  The  goal  is  to  use  the 
implementation  that  will  provide  the  PFAC  with  flexibility,  extensibility  of  the  design, 
and  adequate  speed  of  operation.  The  ROM  implementation  possesses  an  advantage 
over  the  other  methods  in  flexibility  and  extensibility  of'the  design.  Although  the  PFA 
pipeline  processors  operate  at  clock  rates  of  over  60  MHz,  the  PFAC  can  operate  with 
clock  rates  of  20  MHz  and  still  keep  the  pipeline  operating  with  results  every  120 
microseconds.  The  PFAC  design,  as  shown  in  Figure  3-1,  is  a  Reduced  Instruction  Set 
Computer  (RISC). 

The  RISC  architecture  is  simple  both  in  the  instruction  set  and  the  hardware 
needed  to  implement  that  instruction  set.  The  thrust  of  the  RISC  design  is  towards 
efficient  implementation  of  a  straightforward  instruction  set.  The  major  components  of 
the  design  are  a  processor  unit  and  a  microprogrammed  control  unit. 

3.1.1.  Processor  Unit.  The  processor  unit  is  shown  in  Figure  3-2.  The  proces¬ 
sor  unit  consists  of  three  busses,  an  arithmetic  logic  unit.  (AH'),  twenty-eight  processor 
registers,  and  bus  selection  circuitry.  The  three  busses  are  16  bits  wide  The  A  and  B 
busses  are  connected  to  the  outputs  of  the  registers  The  B  bus  is  also  connected  to  16 
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inputs  of  the  branch-test  multiplexer  in  the  microprogrammed  control  unit  The  A  and 
B  busses  form  the  inputs  to  the  AH  The  function  selection  lines  select  the  arithmetic 
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Figure  3-2  PROCESSOR  UNIT 


3.1.2.  MICROPROGRAMMED  CONTROL  UNIT. 

The  control  unit  performs  a  series  of  sequential  micro-operations. 

The  control  memory  operates  as  a  combinational  circuit  with  the  address  as  the  input 
and  the  corresponding  word  as  the  output.  The  word  read  from  the  control  memory  is 
the  control  word.  The  control  word  represents  a  microinstruction.  The  microinstruc¬ 
tion  has  bits  that  specify  one  or  more  microoperations  and  also  has  bits  that  determine 
the  next  address  of  the  control  memory  The  bits  of  the  microinstruction  are  divided 
into  parts  called  fields.  Each  field  defines  a  distinct,  separable  function 


The  CPU  is  organized  around  a  bus  system.  A  bus  organized  processor  can  be 
controlled  with  vertical  microinstructions.  The  decoders  for  the  various  fields  are 
placed  in  the  bus  system.  The  microinstruction  that  controls  this  bus  organized  proces¬ 
sor  unit  can  be  divided  into  four  fields,  with  each  field  supplying  the  control  bits  to 
select  the  function  in  each  component.  The  control  word  format  for  controlling  the  bus 
system  is  shown  in  Figure  3-3.  Two  fields  A  and  B  specify  the  source  registers  for  the 
A  and  B  input  busses  to  the  ALU.  The  ALU  field  specifies  one  of  12  ALU  function. 
The  D  field  designates  the  destination  register  for  storing  the  results  of  the  ALU  opera¬ 
tions.  Since  there  are  twenty-eight  registers,  the  A,  B,  and  D  fields  must  contain  five 


Therefore,  the  ALU  field  contains  four  bits,  The  SPEC  field  uses  a  horizontal  format. 
Each  bit  of  the  field  is  used  to  generate  a  control  signal.  The  specific  names  of  these 
bits  and  the  functions  of  the  specific  control  signals  will  be  discussed  later  in  this 
chapter. 

The  rest  of  the  fields  of  microinstruction  are  used  by  the  address  sequencer  to 
determine  the  address  of  the  next  microinstruction  to  be  executed.  The  CD  field  selects 
one  of  the  status  bits  in  the  branch-test  multiplexers  as  the  condition  bit.  Since  there 
are  thirty-two  inputs  to  the  branch-test  multiplexer,  the  CD  field  has  five  bits  The  BR 
field  is  three  bits  wide  and  specifies  one  of  five  operations  for  the  sequencer.  The  ADF 
field  is  nine  bits  long  since  the  control  memory  has  256  words 

The  microprogrammed  control  unit  consists  of  two  parts:  the  control  memory 
that  stores  the  microinstructions  and  the  associated  circuits  that  control  the  generation 
of  the  next  address.  The  address  generation  part  is  called  a  microprogram  sequencer 
The  block  diagram  of  the  microprogrammed  control  unit  is  shown  in  Figure  3-1  The 
purpose  of  a  microprogram  sequencer  is  to  present  an  address  to  the  control  memory  so 
that  a  microinstruction  may  be  read  and  executed  The  next  address  logic  of  the 
sequencer  determines  the  specific  address  to  lie  loaded  into  the  control  address  register 
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(CAR).  The  choice  of  the  address  source  is  guided  by  the  next  address  information  bits 
that  the  sequencer  receives  from  the  present  microinstruction.  The  CAR  sends  the 
selected  address  to  the  control  memory  and  to  the  incrementer.  The  output  of  the 
incrementer  is  applied  to  one  of  the  address  multiplexer  inputs  and  to  the  subroutine 
return  register  (SBR).  The  SBR  is  a  last-in,  first-out  stack.  The  output  from  SBR  is 
applied  to  another  of  the  address  multiplexer  inputs.  The  two  other  inputs  to  the 
address  multiplexer  come  from  the  address  field  of  the  present  microinstruction  and 
from  the  external  address  circuit.  The  five  condition  field  bits  select  one  of  32  status 
bits  in  the  branch-test  multiplexer. 

The  branch-test  multiplexer  tests  the  value  of  the  selected  bit  and  the  result  is 
applied  to  the  branch  logic  circuit.  The  branch  logic  circuit  has  four  inputs  and  four 
outputs.  The  outputs  of  the  branch  logic  select  a  sequencer  operation  and 
increment/decrement  the  stack  pointer  of  the  SBR.  The  truth  table  for  the  branch 
logic  circuit  is  shown  below. 

12  II  10  T  SI  SO  OPERATION  NOTES 


X  0  0  X  0  0  CAR<-EAD 

X  0  I  X  0  1  CAR<-SBR 

X  1  1  X  1  0  CAR<-CAR+1 

0  110  10  CAR<-CAR+1 

0  11111  CARc-ADF 

1110  10  CAR<-CAR+1 


BRANCH  TO  EXT  ADDR 
RETURN  FROM  SUBR 
INCREMENT  ADDR 
INCREMENT  ADDR 
BRANCH  TO  ADF  ADDR 
INCREMENT  ADDR 


111111  CARc-ADF.SBRc-CAR^l  BRANCH  TO  ADF  ADDR 


T  and  12  are  don’t  cares  unless  (11  10)  ~  (1  1).  For  (11  10)  =  (0  0),  an 
external  address  is  selected.  For  (11  10)  —  (0  1),  the  address  comes  from  the  SBR 
(return  from  subroutine).  For  (11  10)  =  (1  0),  the  address  comes  from  the  incrementer 
When  (12  11  10)  =  (0  1  1),  the  sequencer  executes  a  conditional  branch  if  T  —  1.  other¬ 
wise  the  CAR  is  incremented.  When  (12  II  10)  =  (I  I  i),  the  sequencer  executes  a  con¬ 
ditional  subroutine  call.  A  conditional  branch  and  a  conditional  subroutine  call  both 
select  the  address  from  the  ADF  field  of  the  microinstruction.  However,  the  subroutine 
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call  stores  the  address  from  the  incrementer  in  SBR.  This  is  the  address  next  in 
sequence,  and  it  is  called  the  return  address.  The  return  is  transferred  to  the  CAR 
with  a  subroutine  return  operation. 


The  simplified  Boolean  functions  for  the  branch  logic  circuit  are  : 

Si  =  II 

SO  =  10  *  (T  +  IT) 

PUSH  —  12  1 1  10  T 
POP  =  ir  *  IO 


3.1.3.  Data  Flow  Control. 


In  Chapter  2,  the  control  functions  of  the  PFAC  were  divided  into  data  flow  con¬ 
trol  and  configuration  control  The  signals  necessary  for  data  flow  control  were 
specified  in  Chapter  2  The  master  OPKRATK  and  RIOI1T  I.KFT  signals  are  gen- 
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erated  as  outputs  of  the  control  memory  The  DONE  signals  are  input  to  an  NAND 
gate.  The  output  of  the  NAND  gate  is  applied  to  an  input  of  the  branch-test  multi¬ 
plexer  (ALL  DONE).  The  block  floating  point  exponents  for  the  three  problems  in  the 
PFA  pipeline  are  stored  in  the  temporary  scale  register  (Latch  Scale  Temporary,  LST). 
The  block  floating  point  exponent  from  the  WFTA  processors  for  each  problem  in  the 
PFA  pipeline  is  accumulated  by  adding  the  contents  of  the  temporary  scale  register  to 
the  contents  of  the  permanent  scale  register.  Next,  the  block  floating  point  exponent  of 
the  leading  problem  is  latched  into  the  output  problem  status  register,  (Latch  Scale 
Permanent,  LSP).  Then,  the  contents  of  the  permanent  scale  register  are  shifted  right 
five  bits  (Shift  Register  Mode,  SRI,  SRO).  The  block  floating  point  exponent  which 
was  stored  in  the  temporary  scale  register  must  be  sent  to  the  WFTA  processors 
(STATE/SCALE,  SSC,  DRIVE  LOAD  SIGNALS,  DLS). 

3.1.4.  Configuration  Control. 

The  configuration  control  section,  as  stated  in  Chapter  2,  is  concerned  with  fault 
detection,  fault  correction,  and  system  recovery.  The  flow  chart  for  PFAC  operation 
with  configuration  control  is  shown  in  Figure  3-8  All  error  signals  from  the  PFA  pipe¬ 
line  modules,  are  latched  into  the  pipeline  status  register,  (Latch  Pipeline  Status,  LPS). 
The  pipeline  status  register’s  contents  are  checked  for  the  presence  of  reported  errors. 
The  contents  of  the  pipeline  status  register  are  applied  to  the  inputs  of  a  combinational 
logic  voting  circuit  to  determine  which  module  should  be  assigned  anv  errors  reported. 
If  no  errors  are  reported  then  the  PFAC  continues  operation  of  the  PFA  pipeline.  If 
there  were  errors  the  results  of  the  combinational  logic  voting  circuit  are  latched  into 
the  error  location  register  The  contents  of  the  configuration  register  are  AND'ed  with 
the  contents  of  the  error  location  register  If  the  results  are  not  equal  to  zero  then  the 
error  was  assigned  to  an  active  processor  and  the  pipeline  must  be  reconfigured  Other¬ 
wise,  the  errors  are  just  added  to  the  register  for  the  modules  that  were  assigned  an 


error 


Figure  3-4  MICROPROGRAMMED  CONTROL  UNIT 


Registers  RO  thru  RlT  store  the  error  count  for  each  module  If  the  pipeline  has  to 
be  reconfigured  then  errors  are  still  added  to  the  registers  for  the  modules  assigned  an 
error.  The  reconfiguration  process  is  specified  in  the  microprogram  as  a  sequence  of 
examining  each  group  of  bits  for  each  column  and  comparing  the  error  count  for  the 
watchdog  processors.  The  watchdog  processor  with  the  smallest  number  of  errors  is 
assigned  the  role  of  active  If  both  watchdog  processors  have  the  same  number  of 
errors  then  the  watchdog  processor  with  the  lowest  position  designator  ( row  I  is  lower 
than  row  2)  is  assigned  the  role  of  active  The  health  register  is  used  as  a  mask  ti> 
determine  if  a  processor  is  to  be  turned  off  The  health  register  is  set  hv  the  hi>st  The 


configuration  register  determines  which  processors  are  active  and  which  processors  are 
watchdogs.  The  contents  of  the  configuration  is  latched  onto  the  scale  bus, 
(STATE/SCALE,  SSC;  DRIVE  LOAD  SIGNALS,  DLS;  WATCH  DOG/ ACTIVE  LOAD. 
WAL). 

There  are  four  registers  that  record  any  errors  assigned  to  each  problem  in  the 
pipeline.  This  is  accomplished  by  latching  signal  from  column  1  into  Problem  1  Status 
Register,  from  column  2  into  Problem  2  Status  Register,  and  from  column  3  into  Prob¬ 
lem  3  Status  Register.  The  contents  of  the  problem  status  registers  are  then 
transferred  (Problem  3  Status  Register)  >  Output  problem  Status  register),  (  Problem 
2  Status  Register  >  Problem  3  Status  Register  ),  (  Problem  1  Status  Register  >  Prob¬ 
lem  2  Status  Register). 

3.1.5.  Design  for  Testability. 

Because  the  complexity  of  circuits,  boards,  and  systems  is  increasing,  testing  has 
become  more  expensive  and  more  difficult.  One  approach  to  reducing  the  difficulty  and 
cost  of  testing  is  to  consider  the  testability  of  a  circuit  early  in  the  design  cycle  and  try 
to  improve  a  circuit  testability.  Testability  is  defined  as  a  measure  of  the  ability  to  test 
easily  or  cost-effectively.  Controllability  and  observability  are  two  measures  of  testabil¬ 
ity.  Controllability  is  a  measure  of  how  easily  the  internal  logic  of  a  circuit  can  be  con¬ 
trolled  from  its  inputs.  Observability  is  a  measure  of  how  easily  the  internal  logic  of  a 
circuit  can  be  observed  at  it  outputs.  There  are  two  design  for  testability  techniques 
that  can  be  implemented  to  test  the  PFA  controller,  partitioning,  and  the  use  of  bus 
architecture  systems.  Both  approaches  use  the  divide  and  conquer  approach 

Partitioning  may  be  accomplished  by  using  degating  techniques.  With  (legating 
individual  components  are  isolated  so  that  test  patterns  can  be  input  and  output  to 
specific  components. 


The  chip  can  be  operated  in  normal  mode  or  test  mode.  A  TEST  signal  can  con¬ 
trol  the  mode  of  the  chip.  All  the  input  pads  drive  a  test  input  in  addition  to  a  func¬ 
tional 

output  and  the  output  pads  have  a  2:1  multiplexer  on  their  inputs.  By  controlling 
these  multiplexers  with  the  TEST  signal  most  of  the  pads  can  perform  a  dual  role  in 
normal  and  test  mode.  Additional  test  circuitry  can  be  isolated  from  the  normal  opera¬ 
tion  circuitry  by  the  use  of  transmission  gates.  The  registers  can  be  converted  to  shift 
registers  and  also  isolated  from  other  components.  When  the  test  signal  is  high,  test 
data  can  be  propagated  through  the  registers.  The  testing  costs  for  this  method  include 
the  area  required  for  the  signal  routing  and  for  the  multiplexers. 

The  second  design  for  testability  technique  is  facilitated  by  the  use  of  a  bus- 
structured  architecture.  The  host  can  access  all  the  registers  using  the  busses  and  iso¬ 
late  each  register  from  the  busses.  The  busses  make  it  possible  to  test  the  registers  by 
applying  test  patterns  to  each  register  separately.  When  the  chip  is  in  test  mode  the 
start  address  for  the  control  memory  can  be  incremented  to  select  control  words 
specifically  designed  to  test  CPU  circuitry.  The  testing  costs  of  this  method  include  the 
additional  memory  and  extra  bits  in  the  address  sequencing  circuits. 

3.2.  Memory 

3.2.1.  Introduction. 

Chapter  1  examined  the  various  technologies  used  to  implement  random  access 
memories  and  presented  error  control  codes  as  a  method  of  error  detection  and  control 
In  chapter  2,  the  requirements  for  memory  capacity  and  control  were  presented 
Approaches  and  trade-offs  were  also  highlighted.  Since  we  need  high  performance  ran¬ 
dom  access  memories  with  low  power  consumption  and  high  noise  immunity,  tin- 
memories  are  being  implemented  in  CMOS  technology  using  a  six  transistor  SRAM  cell 
with  a  linear  block  error  control  code  for  error  detection  and  correction  A  floor  plan  of 


a  PFA  pipeline  memory  is  shown  in  Figure  3-5.  Its  key  sections  are  the  RAM  array, 
the  read  circuitry  (pull  ups  and  sense  amplifiers),  the  write  circuitry,  the  error  detection 
and  correction  circuitry  (encoder  and  decoder),  the  address  selection  circuitry,  and  the 
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Figure  3-5  MEMORY  FLOORPLAN 


switching  circuitry  (the  switching  circuitry  is  embedded  in  various  parts  of  the 
memory).  A  block  diagram  showing  a  RAM  cell  with  associated  read  and  write  circui¬ 
try  is  shown  in  Figure  3-6. 

3.2.2.  RAM 

A  six-transistor  CMOS  static  RAM,  shown  in  Figure  2-3  is  used  for  data  storage 
for  the  PFA  pipeline  memory.  The  SRAM  was  selected  because  SRAMs  tend  to  be  fas¬ 
ter  than  DRAMS,  easier  to  design  than  DRAMS,  and  are  potentially  less  troublesome 
than  DRAMs  [We85j.  The  design  presented  here  is  a  "safe"  design  that  is  fairly  tolerant 
to  process  variations  but  it  is  not  the  fastest  that  can  be  designed.  However,  high 
speeds  can  be  achieved  in  large  arrays  because  the  output  is  a  differential  signal.  A 
differential  output  signal  eliminates  a  whole  class  of  common  mode  noise  sources  which 
enables  the  use  of  a  sense  amplifier  with  lower  noise  margin.  Lower  noise  margins  per¬ 
mit  higher  speeds.  The  circuit  consists  of  a  pair  of  cross-coupled  inverters  connected  by 
pass  transistors  to  BITLINE  and  BITLINE’.  A  pair  of  cross-coupled  inverters  are  the 
simplest  form  of  static  latch.  In  the  absence  of  noise,  a  static  latch  will  remember  a 
state  for  as  long  as  power  is  applied  to  the  circuit.  The  static  latch  has  elements  with 
power  gain  that  continually  counteract  the  effect  of  charge  leakage  at  the  drains  of  the 
pull-down  and  pass  transistors.  These  elements  with  power  gain  are  the  p-load  transis¬ 
tors. 

3.2.3.  Read  Circuitry. 

To  read  from  the  memory  cell,  the  pull-up  circuitry  is  used  with  a  sense  amplifier 
Both  Bitline  and  Bitline’  are  precharged  to  Vdd  by  turning  transistors  Pi  and  Pa  >>n 
To  ensure  that  Bitline  and  Bitline'  are  at  the  same  potential,  transistor  P3  is  turned  on 
When  P3  is  turned  on  Bitline  is  shorted,  through  P3.  '  Bitline'  These  three  transis¬ 

tors  are  only  turned  on  when  PRE  and  R  W  are  true  Since  the  memory  cell  generates 
a  differential  output  signal,  a  differential  sense  amplifier  may  be  employed  detect  the 


output  signal  A  differential  sense  amplifier  senses  :i  small  d iffererice  between  levels  ■ 
t he  Hitline  and  Hitline'  lines  and  amplifies  tins  difference  t<>  provide  very  fast  s*uism 


stage  sense  amplifier  is  discussed  in  detail  by  Soclof  [So85]. 

3.2.4.  Write  Circuitry. 

Four  transistors  (Pl,NI,  P2,  N2),  controlled  by  simple  logic  circuitry,  are  used  to 
write  to  the  memory  cell,  as  shown  in  Figure  3-7.  Writing  a  "one"  into  the  memory  cell 
is  accomplished  by  raising  the  word  line  after  precharging  Bitline  to  Vdd  and  discharg¬ 
ing  Bitline’  to  ground.  Writing  a  "zero"  into  the  memory  cell  is  the  same  as  writing  a 
"one”  except  Bitline’  is  precharged  to  Vdd  and  Bitline  is  discharged  to  ground.  The 
input  signals  gPl,  gNl,  gP2,  and  gN2  are  the  gate  signals  for  transistors  Pi,  NT,  P2. 
and  N2  respectively.  During  a  read  operation  the  write  circuitry  does  not  affect  the  bit 
lines. 

The  logic  equations  for  the  gate  signals  are  : 
gPl=(R/W’*BIT’)+R/W 
gNl=(R/W’*BIT) 
gP2=(R/W’*BIT)+R/W 
gN2=(R/W’*BIT) 

3.2.5.  Address  Selection  Circuitry. 

The  address  selection  circuitry  includes  a  NAND  PLA  and  wordline  drivers.  As 
shown  in  Figure  3-8,  the  NAND  PLA  decodes  the  9-bit  address  input  and  a  wordline 
driver  raises  the  wordline  that  corresponds  to  that  address.  The  address  input  is 
broadcast  to  every  address  in  the  PLA.  The  decoder’s  output  to  the  wordline  is  pulled 
low  when  it  is  the  selected  wordline.  The  decoder  s  output  is  input  to  the  RAM  array 
through  a  two-input  NOR  gate,  followed  by  two  inverters.  The  wordline  driver  output 
to  the  wordline  is  pulled  high  it  is  the  selected  wordline.  The  PRK  signal  is  input  t<> 
the  two-input  NOR  gate  to  disable  wordline  selection  during  precharge 

3  2.6  Error  Detection  and  Correction  Circuitry. 

The  error  detection  and  correction  circuitry  consists  of  the  encoder  and  the 
decoder.  In  chapter  2.  we  discussed  error  control  codes  were  discussed  The  linear 
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block  code  discussed  will  fulfil!  the  error  detection  :md  correction  requirements  of  single 
error  correction  and  double  error  detection  The  parity-check  matrix  I*  for  a  linear 
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block  code  is  shown  in  Figure  3-9.  Each  parity  bit  can  be  generated  with  the  circuit 
shown  in  Figure  3-10.  All  data  bits  are  only  used  twice  and  the  longest  path  is  three 
levels  of  exclusive-ORs. 


The  parity  check  portion  of  the  systematic  is: 


P  MATRIX 

VO  VI  V2  V3  V4  V5  V6  V7 


N-K  columns 


Encoder  equations  for  the  parity  bits  are: 
VO  =  U0  +  U7  +  U8  +  U14  +  U16  +  U21 

VI  =  U0  ♦  U1  +  U9  +  U1S  +  U 17  +  U22 

V2  =  til  ♦  U2  +  U8  +  U10  +  U13  +  U23 

V3  =  U2  +  U3  ♦  U9  ♦  Ull  ♦  U16  +  t’19 

V4  s  t!3  ♦  Li4  ♦  U10  ♦  U12  ♦  U17  +  L’20 

V5  =  U4  ♦  U5  ♦  Ull  +  U13  ♦  U18  +  U21 

V«  =  US  ♦  U6  ♦  U12  ♦  U14  ♦  U19  ♦  U22 

V7  =  U6  ♦  U7  ♦  U13  ♦  U15  +  U20  ♦  U23 

K  rows 

1.  40  GATES  REQUIRED 

2.  THREE  LEVELS  OF 

TWO  INPUT  EXCLUSIVE  OR* 

3-  ALL  DATA  BITS  ONLY  USED  TWICE 


Figure  3-9  PARITY  CHECK  MATRIX  E 


The  decoder  consists  of  a  syndrome  calculation  circuit  and  an  error  pattern  detec¬ 
tion  circuit.  The  matrix  for  generating  syndrome  digits  is  shown  in  Figure  3-11.  Each 
syndrome  digits  can  be  generated  with  the  circuit  shown  in  Figure  3-12.  All  the  data 
bits  are  only  used  twice  and  the  longest  path  is  three  levels  of  exclusive-ORs. 

The  matrix  for  generating  the  error  bits  is  shown  in  Figure  3-13.  The  error  bits 
are  generated  as  outputs  of  a  PLA.  Each  error  bit  is  the  product  of  the  syndrome 
digits.  For  a  "one"  in  the  matrix  AND  the  syndrome  digit.  For  a  "zero"  in  the  matrix 
AND  the  inverse  of  the  syndrome  digit.  The  PLA  has  eight  inputs  and  24  outputs  (we 
don’t  need  the  error  bits  for  the  parity  bits).  The  outputs  of  the  PLA  are  combined 
with  the  data  bits  output  directly  from  memory.  The  outputs  of  the  exclusive-ORs  are 
the  corrected  output. 

3.2.7.  Switching  circuitry. 

The  switching  circuitry  includes  circuits  for  switching  address  selection  signals  as 
well  as  switching  data  signals.  The  address  selection  line  switching  is  accomplished  at 


Figure  3-10  ENCODER  CIRCUTRY 
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the  address  input  pads.  Then'  arc  nine  input  pads  for  the  read  address  and  nine  input 
pads  for  the  write  address.  Each  input  pad  has  a  12  multiplexer  One  multiplexer  out- 


put  goes  to  the  appropriate  address  line  in  the  left  decoder  and  the  other  output  goes 
to  the  corresponding  address  line  in  the  right  decoder.  The  output  selection  is  con¬ 
trolled  by  the  RIGHT/LEFT  signal.  RIGHT/LEFT  =  1,  selects  read  right  bank  and 
write  left  bank. 

The  input  data  lines  are  not  switched.  The  input  data  is  sent  to  the  input  data 
bus.  The  input  data  bus  is  connected  to  the  inputs  of  the  encoder  and  to  the  inputs  of 
the  write  circuitry  of  the  right  and  the  write  circuitry  of  the  left  memory  bank.  How¬ 
ever,  only  the  write  circuitry  of  the  memory  bank  selected  to  write  will  utilize  the  input 
data. 

The  output  data  is  placed  on  an  output  data  bus.  The  outputs  of  the  sense 
amplifiers  are  connected  to  the  output  data  bus  with  tri-state  drivers.  The  outputs  of 
the  sense  amplifiers  of  the  memory  bank  not  selected  to  read  are  put  into  a  high 
impedance  state. 

The  signal  RIGHT/LEFT  is  used  to  derive  the  signal  READ/ WRITE.  For  the 
right  memory  bank  READ/WRITE  is  the  same  as  RIGHT  LEFT  but  for  the  left 
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E  BITS  ARE  GENERATED  BY  AND1NG  THE 
SYNDROME  BITS,  FOR  EXAMPLE: 
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1.  A  COMBINATIONAL  LOGIC  CIRCUIT 
CONSISTING  OF  24  EIGHT  INPUT  AND  GATES 
WITH  24  EXCLUSIVE-ORS 

2.  PLA  WITH  cEIGHT  INPUTS.  24  OUTPUTS 
AND  24  EXCLUSIVE-ORS 
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The  precharge  signals,  PREL  for  the  left  memory  bank  and  PRER  for  the  right 
memory  bank,  are  also  derived  from  the  write  enable  (WE)  and  read  enable  (RE). 

3.3.  Prototype  System. 

A  prototype  system  will  be  built  to  evaluate  the  performance  of  the  pipeline  com¬ 
ponents.  This  prototype  system  will  be  configured  as  shown  in  Figure  3-14.  The  sys¬ 
tem  shown  will  reduce  the  memory  requirements  and  still  permit  the  calculation  of  a 
272-point  DFT.  A  272-point  DFT  only  requires  544  64-bit  words  (272  64-bit  words  in 
each  bank).  Using  a  3-micron  fabrication  process,  the  RAM  array  would  be  11424 
microns  by  7104  microns.  The  PFAC  I/O  requirements  would  be  reduced,  enabling  the 
use  of  a  84-pin  configuration  instead  of  the  144-pin  package  for  a  4080-point  DFT. 
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CHAPTER  4 


VLSI  DESIGN 


4.1.  Method 

The  PFA  controller  was  designed  but  not  implemented  in  this  thesis  effort.  The 
design  utilized  components  that  are  available  in  the  AF1T  VLSI  design  library.  The 
initial  design  of  the  PFA  controller  has  flexibility  and  has  the  capability  for  fault  detec¬ 
tion,  isolation,  and  correction. 

The  dual-ported  double  buffered  memory  with  error  control  coding  was  specified  in 
Caltech  Intermediate  Form  (CIF).  Although  previous  design  projects  at  AFIT  were 
implemented  using  CMOS  technology,  the  design  rules  for  fabrication  of  the  basic  cells 
have  changed.  Before  any  previously  designed  microcells  could  be  used,  much  time  and 
effort  was  taken  to  design  the  critical  cells  to  minimize  the  area  used,  insure  functional¬ 
ity,  and  maximize  speed. 

4.1.1.  Tools. 

Several  Computer-Aided  Design  (CAD)  tools  were  used  in  designing  the  cells 
required  for  the  PFA  pipeline  memory. 

The  circuits  were  created  and  edited  using  Caesar  (Ousterhout,  1983).  an  interac¬ 
tive  VLSI  design  tool  that  utilizes  a  color  graphics  display  terminal  and  a  digitizing 
pad. 

Lyra  was  used  to  perform  design  rule  checking  on  the  caesar  circuits  (Arnold.  1983). 

The  SPICE  simulation  program  was  used  to  determine  if  the  data  paths  of  the  design 
operated  at  the  required  speed. 

Mextra  (Fitzpatrick,  1983)  and  Cstat,  a  CMOS  version  of  stat  (Baker,  1985) 
developed  at  AFIT  provided  information  on  unconnected  or  aliased  nodes  and 


transistors  which  could  not  affect  the  outputs  or  be  affected  by  the  inputs. 


4.1.2.  Design  Rules. 

The  circuits  were  designed  using  a  scalable  design  rule  set  allowing  for  either  a  1.2 
or  3  micron  implementation.  The  CMOS  design  rules  used  are  listed  in 
Appendix  A. 

4.2.  PFA  Controller. 

In  chapter  3,  the  high  level  design  of  the  PFA  controller  was  presented.  In  this 
chapter,  the  transistor  level  designs  of  the  cells  and  modules  are  presented. 

4.2.1.  Processor  Unit.  The  main  modules  of  the  processor  unit  are  an  ALU,  twenty- 
eight  processor  registers,  and  bus  selection  circuitry. 

The  design  of  the  ALU  is  a  combinational  logic  circuit.  The  unit  has  a  regular 
pattern  that  can  be  broken  down  into  identical  stages.  Each  stage  consists  of  a  full- 
adder  with  the  input  to  each  full-adder  specified  by  the  Boolean  functions: 

A  typical  ALU  stage  is  implemented  as  shown  in  Figure  4-1. 

The  boolean  input  functions  can  be  implemented,  using  CMOS  Inverter 
Transmission  Gate  Logic  (C1TGL),  with  a  total  of  sixteen  transistors  The  x  input  is 
implemented  with  a  2:1  multiplexer,  an  AND  gate,  and  an  OR  gate  The  Y  input  is 
implemented  with  only  a  2:1  multiplexer  The  Z  input  only  requires  a  transmission 
gate  for  implementation 

The  full-adder  is  implemented,  using  CITGL,  with  a  total  of  nineteen  transistors 
The  sum  output  is  obtained  with  an  exclusive-OR  gate  and  an  exclusive-NOR  gate 
The  carry  output  is  realized  with  two  AND  gates  and  an  OR  gate 

The  twenty-eight  processor  registers  are  implemented  with  sixteen  master-slave 
flip-flops  which  in  turn  consist  of  two  I)  flip-flops  Master-slave  flip-flops  (MSFF)  are 
required  so  that  the  new  data  can  be  arriving  at  the  input  to  the  master  while  the  old 
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Figure  4-2  MASTER-SLAVE  FLIP-FLOP 


data  is  still  stable  at  the  output  of  the  slave.  The  master-slave  flip-flop  functionality 
can  be  realized  using  CITGL  Only  sixteen  transistors  are  required  to  implement  a 
master-slave  flip-flop  as  shown  in  Figure  4-2.  Two-phase  clocking  is  used  to  guarantee 
no  overlap  between  the  two  phases  and  thus  prevent  any  possibility  of  racing  through 
the  MSFF.  All  the  registers,  with  one  exception,  are  implemented  in  this  manner  The 
one  exception  is  the  permanent  scale  register  This  register  is  a  shift  register,  with 
parallel  load,  that  does  a  five  bit  shift.  Several  of  the  registers  have  data  input  directly. 
The  data  is  gated  to  the  inputs  through  transmission  gates  All  the  registers  including 
special  circuitry  are  shown  in  Appendix  C. 

The  bus  selection  circuitry  consists  of  a  combination  of  a  32- bit  decoder  and  a  2  1 
multiplexer  as  shown  in  Figure  4-3.  The  32- bit  decoder  enables  the  selection  of  any  one 
of  32  addresses.  The  2:1  multiplexor  enables  the  host  to  specify  addresses,  during  a 
test,  instead  of  the  control  memory 


sequencer  as  shown  in  Figure  3-1. 


4.2.2. 1.  Control  Memory. 


As  previously  stated,  the  microinstructions  for  the  RISC  are  stored  in  a  control 
memory.  The  control  memory  is  implemented  with  a  very  dense  ROM  design.  This 
dense  ROM  design  is  one  that  uses  a  very  compact  XROM  cell  (Wilson  and  Schroeder, 
1978;  McKenny,  1980).  The  XROM  cell  receives  its  name  from  the  X  pattern  of  its 
transistors,  as  shown  in  Figure  4-4.  The  XROM  cell  is  capable  of  storing  four  bits  in  a 
cell  that  is  just  under  8x8  microns  square  (using  the  design  rules  in  Appendix  A  and 
fabricating  with  the  1.2  micron  process).  A  data  word  can  be  retrieved  from  the  XROM 
in  45  ns  for  a  3  micron  process  jFr86j 
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4. 2. 2. 2.  Address  Sequencer. 
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The  address  sequencer  controls  the  generation  of  the  address  of  the  next  stored 
microinstruction  to  be  executed.  The  address  sequencer  consists  of  an  address  multi¬ 
plexer,  a  control  address  register,  an  incrementer,  a  subroutine  return  register,  a 
branch-test  multiplexer,  and  branch  logic,  as  shown  in  Figure  3-4. 

The  address  multiplexer  is  a  four  to  one  multiplexer  that  selects  a  nine-bit  address 
and  sends  the  selected  address  to  the  control  address  register,  as  shown  in  Figure  4-5. 
The  design  shown  is  obtained  with  fifty  four  transmission  gates.  Segregating  the  input 
busses  as  shown  minimizes  the  amount  of  bus  crossover. 

The  control  address  register  is  implemented  with  nine  master-slave  flip-flops  which 
in  turn  consist  of  two  D  flip-flops. 
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The  incrementer  is  a  combinational  logic  circuit  constructed  by  cascading  a  series 
of  half-adder  circuits.  One  input  of  the  least  significant  stage  must  always  be  a  "one”. 
The  carry  from  each  stage  is  connected  to  one  of  the  inputs  of  the  next  stage.  A  half- 
adder  can  be  constructed,  using  CITGL,  with  seven  transistors,  as  shown  in  Figure  4-6 

The  subroutine  return  register  consists  of  a  last-in,  first-out  (LIFO)  register  stack. 
The  address  lines  are  input  to  a  series  of  MSFFs.  The  outputs  of  each  MSFF  has  a 
feedback  loop  to  the  input  of  the  previous  MSFF.  The  feedforward  path  is  enabled  by 
the  PUSH  signal  and  the  feedback  path  is  enabled  by  the  POP  signal.  An  empty  full 
stack  detector  is  by  constructing  an  additional  column  in  the  stack.  A  FULL  signal  is 


generated  by  ANDing  the  output  of  the  deepest  stage  with  the  PUSH  signal  An 


EMPTY  signal  is  generated  by  ANDing  the  output  of  the  first  stage  with  the  POP  sig¬ 


nal.  The  SBR  is  shown  in  Figure  4-7. 


The  branch-test  multiplexer  is  the  same  circuit  as  the  32-bit  address  selection 


decoder  except  the  roles  of  input  and  output  are  reversed. 


The  branch  logic  is  implemented  with  five  two  input  AND  gates,  an  OR  gate,  and 


an  inverter,  as  shown  in  Figure  4-8. 
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4.3.  Memory 


The  memory  chip  circuits  were  presented  in  Chapter  3.  The  main  components 
are:  the  RAM  cell,  the  read  circuitry,  the  write  circuitry,  the  error  detection  and  correc¬ 
tion  circuitry,  and  the  switching  circuitry. 

4.3.1.  RAM  Cell  A  cifplot  of  the  six  transistor  RAM  cell  is  shown  Figure  4-9.  The 
following  sizes  assume  the  use  of  the  1.5  micron  process.  The  RAM  cell  is  42  micrometers  high 
by  67.5  micrometers  wide  with  stepping  distances  of  42  and  55.5  micrometers.  The  size  of  the 
RAM  array  is  7104  micrometers  by  11424  micrometers  for  a  dual  bank,  272  x  64-bit  word 
memory. 

4.3.2.  Read  Circuitry 

The  read  circuitry  consists  of  the  pull  up  transistors  and  the  sense  amplifier. 

A  CIFplot  of  the  pull  up  transistors  are  shown  in  Figure  4-10.  Because  the  pur¬ 
pose  of  the  pull  up  transistors  is  to  not  only  raise  BITLINE  and  BITLJNE’  but  also,  to 
ensure  that  they  are  the  same  by  shorting  the  two  lines  together,  a  very  compact  cell 
was  implemented.  The  active  region  of  P3  connects  the  source  of  one  transistor  to  the 
source  of  the  other  transistor. 

The  sense  amplifier  was  designed  to  be  arrayed  at  the  same  pitch  as  the  RAM 
array.  In  a  SPICE  simulation  of  this  two  stage  sense  amplifier,  the  output  inverter 
detected  a  differential  voltage  of  0.01  volts.  The  amount  of  time  necessary  to  read  a 
RAM  cell  is  directly  proportional  to  the  magnitude  of  the  differential  voltage  needed  for 
detection 


4.3  3.  Write  Circuitry. 

The  write  circuitry  is  implemented  with  two  NANI)  gates  and  two  NOR  gates 


The  write  circuitry  is  arrayed  at  the  same  pitch  as  the  RAM  array 
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Figure  4-10  READ  PULL  UP  TRANSISTORS 


4.3.4.  Error  Detection  and  Correction  Circuitry. 

The  error  detection  detection  circuitry  consists  of  a  parity  bit  generation  circuit,  a 
syndrome  bit  generation  circuit,  and  an  error  pattern  detecting  circuit  The  CIFplot  of 
a  typical  parity  bit  generator  circuit  is  shown  in  Figure  4-11  The  design  of  the  syn¬ 
drome  bit  generator  circuit  is  the  same  as  the  parity  bit  generator  circuit  except  that 
seven  inputs  are  combined  for  each  output.  The  error  bit  generator  is  an  eight  input 
24  output  PLA,  as  shown  in  Figure  4-12. 

4.3.5.  Address  Selection  Circuitry 

The  address  selection  circuitry  consists  of  the  decoder  and  the  wordline  driver 
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The  decoder  is  a  NAND  PLA  with  snaked  gates  to  minimize  the  pitch  of  the  cells  The 
pitch  of  the  decoder  and  the  wordline  driver  is  half  the  pitch  <>f  the  RAM  array 


CHAPTER  5 


CONCLUSIONS  and  RECOMMENDATIONS 


5.1.  Conclusions 

This  thesis  effort  produced  the  design  of  the  PFA  Controller  which  will  operate  at 
a  clock  rate  of  20  MHz.  The  RISC  implementation  possesses  an  advantage  over  other 
methods  in  flexibility  and  extensibility  of  the  design.  The  RISC  architecture  of  the  con¬ 
troller  is  simple  to  implement.  The  instruction  set  is  limited  to  the  necessary  micro¬ 
operations.  If  the  control  requirements  change,  the  microprogram  can  be  changed  by 
changing  the  personalization  of  the  control  memory  ROM.  In  addition,  the  design  of 
the  RISC  can  be  used  for  control  in  many  diverse  situations.  Since  the  RISC  architec¬ 
ture  minimizes  hardware  and  software,  the  complexity  of  the  design  effort  is  also 
minimized.  Also  the  PFA  pipeline  memory  was  designed  with  performance  comparable 
to  that  of  commercially  available  state-of-the-art  CMOS  RAMs.  The  worst  case  access 
times  should  be  less  than  28  ns. 

The  realization  of  the  PFA  pipeline  controller  has  its  primary  importance  in  the 
contribution  it  makes  toward  achieving  the  overall  goal  of  implementing  a  PFA  proces¬ 
sor  capable  of  computing  4080-point  DFTs  at  a  rate  of  over  8300  Hz  (Taylor.  19851 
while  incorporating  a  high  degree  of  fault  tolerance  into  the  system 

5.2.  Recommendations 

The  following  recommendations  are  proposed  regarding  future  action: 

1.  The  PFA  controller  should  be  implemented  using  the  3  micron  process  to 
enable  performance  evaluations.  Extensive  test  results  are  required  before  the  PFA  con¬ 
troller  is  submitted  for  fabrication  in  the  12  micron  process 
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Figure  5-1  MEMORY  EVALUATION  CHIP 


2.  The  memory  chip  should  be  implemented  using  the  3  micron  process  to  evalu¬ 
ate  the  performance  of  the  circuits.  However,  a  3  micron  fabrication  necessitates 
decreasing  the  size  of  the  memory  array.  The  memory  could  be  fabricated  as  a  set  of 
four  chips,  two  for  the  real  numbers  and  two  for  the  imaginary  numbers  The  address 
decoder  would  have  the  same  number  of  address  lines  but  the  address  lines  would  be 
broadcast  to  four  chips,  two  chips  for  the  real  numbers  and  two  chips  for  the  imaginary 
numbers.  A  typical  chip  is  shown  in  Figure  5-1.  This  arrangement  would  degrade  the 
speed  of  operation  but  would  permit  testing  of  the  functionality  of  the  pipeline  struc¬ 
ture. 

3.  After  the  actual  performance  of  the  data  switching  circuits  and  error 
detection-correction  circuits  are  evaluated,  a  dynamic  RAM  memory  approach  should 
be  pursued  to  decrease  the  size  of  the  memory  chips 

4  To  increase  the  ECC  decoders  speed,  use  two  input  OR  gates  in  lieu  of  the 
orplane  shown  in  Figure  4-12. 
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